Skip to content

Instantly share code, notes, and snippets.

@napsternxg
Last active August 10, 2022 03:34
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save napsternxg/2750479273e0621c5aa697bf89843428 to your computer and use it in GitHub Desktop.
Save napsternxg/2750479273e0621c5aa697bf89843428 to your computer and use it in GitHub Desktop.
Large-scale dataset? Year of release Name Reference URL to ref URL to data Access Price License Summarization type Language Summaries specifically written for the corpora Need to generate data? domain muli-doc? nb of texts LREC nb of texts nb of texts per topic nb of gold summaries per text to summarize input length output length LREC output length generic? Misc comments
Main summarization corpora
n 2001 DUC 2001 ? http://www-nlpir.nist.gov/projects/duc/pubs.html http://www-nlpir.nist.gov/projects/duc/data.html Email request 0 Abstractive English n news both 60x10 600 10 1 50, 100, 200, 400 multi-doc: 50, 100, 200, 400 words; single-doc: 100 words generic? See "DUC in context" Table 1 for more details
n 2002 DUC 2002 ? http://www-nlpir.nist.gov/projects/duc/pubs.html http://www-nlpir.nist.gov/projects/duc/data.html Email request 0 Abstractive, extractive English n news both 60x10 600 10 2 10, 50, 100, 200, 400 multi-doc: 10, 50, 100, 200 words; single-doc: 100 words generic?
n 2003 DUC 2003 ? http://www-nlpir.nist.gov/projects/duc/pubs.html http://www-nlpir.nist.gov/projects/duc/data.html Email request 0 Abstractive English n news both 60x10, 30x25 624 ~10 1? 10, 100 both
n 2004 DUC 2004 ? http://www-nlpir.nist.gov/projects/duc/pubs.html http://www-nlpir.nist.gov/projects/duc/data.html Email request 0 Abstractive English+Arabic n news both 100x10 ~740 ~10 4 10, 100 50, 100, 250 words both
n 2005 DUC 2005 ? http://www-nlpir.nist.gov/projects/duc/pubs.html http://www-nlpir.nist.gov/projects/duc/data.html Email request 0 Abstractive English n news y 50x32 25-50 250 250 words query-focused
n 2006 DUC 2006 ? http://www-nlpir.nist.gov/projects/duc/pubs.html http://www-nlpir.nist.gov/projects/duc/data.html Email request 0 Abstractive English n news y 50x25 25-50 4 250 words query-focused
n 2007 DUC 2007 ? http://www-nlpir.nist.gov/projects/duc/pubs.html http://www-nlpir.nist.gov/projects/duc/data.html Email request 0 Abstractive English n news y 25x10 100 update http://duc.nist.gov/duc2007/tasks.html
n 2008 TAC 2008 ? https://tac.nist.gov/publications/index.html https://tac.nist.gov/data/index.html Email request 0 Abstractive English n news y 48x20 960 20 100 update,query
n 2009 TAC 2009 ? https://tac.nist.gov/publications/index.html https://tac.nist.gov/data/index.html Email request 0 Abstractive English n news y 44x20 880 20 100 guided https://tac.nist.gov/data/index.html
n 2010 TAC 2010 ? https://tac.nist.gov/publications/index.html https://tac.nist.gov/data/index.html Email request 0 Abstractive English n news y 46x20 920 20 100 guided
n 2011 TAC 2011 ? https://tac.nist.gov/publications/index.html https://tac.nist.gov/data/index.html Email request 0 Abstractive English n news y 44x20 880 20 100 guided
n 2003 ICSI Janin, Adam, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin et al. "The ICSI meeting corpus." In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on, vol. 1, pp. I-I. IEEE, 2003. https://scholar.google.com/scholar?cluster=734196485602731249&hl=en&as_sdt=0,5 Abstractive, extractive English transcribed meetings n 57 57 3 human abstractive and 3 human extractive summaries are available, of respective average sizes 390 words and 133 utterances. 390
n 2005 AMI McCowan, Iain, Jean Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot et al. "The AMI meeting corpus." In Proceedings of the 5th International Conference on Methods and Techniques in Behavioral Research, vol. 88. 2005. https://scholar.google.com/scholar?cluster=9565292835176993645&hl=en&as_sdt=0,5 Abstractive, extractive English transcribed meetings n 137 137 1 human-written abstractive summary of 300 words on average, and with a human-composed extractive summary (140 utterances on average). 300
n 2010 Opinosis Ganesan, K. A., C. X. Zhai, and J. Han, "Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions", Proceedings of the 23rd International Conference on Computational Linguistics (COLING '10), 2010. http://kavita-ganesan.com/opinosis http://kavita-ganesan.com/opinosis-opinion-dataset Publicly available on website 0 Abstractive English y n product reviews y 51x100 ~5100 (51 topics, each containing around 100 sentences) 51 4 human abstracts 1 sentence 25 ~25 words
y 2003 Gigaword Graff, David, and Christopher Cieri. English Gigaword LDC2003T05. Web Download. Philadelphia: Linguistic Data Consortium, 2003. ? https://catalog.ldc.upenn.edu/ldc2003t05 Publicly available on website 3000 Abstractive English n n news n 4111240 4111240 1 Headline y
y 2005 Gigaword 2 Graff, David, et al. English Gigaword Second Edition LDC2005T12. Web Download. Philadelphia: Linguistic Data Consortium, 2005. ? https://catalog.ldc.upenn.edu/LDC2005T12 Publicly available on website 400 Abstractive English news
y 2007 Gigaword 3 Graff, David, et al. English Gigaword Third Edition LDC2007T07. Web Download. Philadelphia: Linguistic Data Consortium, 2007 ? https://catalog.ldc.upenn.edu/LDC2007T07 Publicly available on website 4000 Abstractive English news
y 2009 Gigaword 4 Parker, Robert, et al. English Gigaword Fourth Edition LDC2009T13. Web Download. Philadelphia: Linguistic Data Consortium, 2009. ? https://catalog.ldc.upenn.edu/LDC2009T13 Publicly available on website 5000 Abstractive English news
y 2011 Gigaword 5 Parker, Robert, et al. English Gigaword Fifth Edition LDC2011T07. DVD. Philadelphia: Linguistic Data Consortium, 2011. ? https://catalog.ldc.upenn.edu/ldc2011t07 Publicly available on website 6000 Abstractive English news 9876086 9876086
y 2015 LCSTS LCSTS: A Large Scale Chinese Short Text Summarization Dataset http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP229.pdf http://icrc.hitsz.edu.cn/Article/show/139.html 0 2.The original copyright of all the data of the Large Scale Chinese Short Text Summarization Dataset belongs to writers of the Weiboes, Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen Graduate School collects, organizes, filters and purifies them. LCSTS is free to the public. 3.If you want to use the dataset for depth study, data providers (Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen Graduate School) should be identified in your results.4.The dataset is only for the specified applicant or study groups for research purposes. Without permission, it may not be used for any commercial purposes. Abstractive Chinese n n Chinese microblogging website SinaWeibo n 2400591 2400591 1 short text even shorter text y Also contains 10,666 human labeled (short text, summary) pairs, the score ranges from 1 to 5 which indicates the relevance between the short text and the corresponding summary, as well as 1,106 pairs which are scored by 3 persons simultaneously.
y 2015 CNN/Daily Mail dataset (Hermann et al., 2015; Nallapati et al., 2016) https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Teaching+machines+to+read+and+comprehend.&btnG= 0 Abstractive English n y news n 312084 312084 1 typical news article a few sentences y
y 2016 MSR Abstractive Text Compression Dataset A Dataset and Evaluation Metrics for Abstractive Compression of Sentences and Short Paragraphs. Kristina Toutanova, Chris Brockett, Ke M. Tran, and Saleema Amershi, EMNLP 2016 https://scholar.google.com/scholar?cluster=11978909955936947219&hl=en&as_sdt=0,5 https://www.microsoft.com/en-us/download/details.aspx?id=54262 Publicly available on website 0 Abstracted English y n business letters, newswire, journals, and technical documents sampled from the Open American National Corpus (OANC). n 6000 6000 26000/6000 two-sentence paragraphs
Less commonly used datasets
LREC 2016 A Publicly Available Indonesian Corpora for Automatic Abstractive and Extractive Chat Summarization http://www.lrec-conf.org/proceedings/lrec2016/pdf/366_Paper.pdf Email request 0 Indonesian 300 chat logs 3
LREC 2014 Building a Dataset for Summarization and Keyword Extraction from Emails http://www.lrec-conf.org/proceedings/lrec2014/pdf/1037_Paper.pdf Email request 0 English 349 emails and threads have been annotated. 100k words.
LREC 2014 A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization http://www.lrec-conf.org/proceedings/lrec2014/pdf/1093_Paper.pdf 0 English Automatically generated from 2004 DUC using summarization systems
LREC 2014 Priberam Compressive Summarization Corpus: A New Multi-Document Summarization Corpus for European Portuguese http://www.lrec-conf.org/proceedings/lrec2014/pdf/187_Paper.pdf 0 European Portuguese 100 words
LREC 2014 LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization http://www.lrec-conf.org/proceedings/lrec2014/pdf/578_Paper.pdf 0 English Automatically generated from TAC 2011, wih summarization system output error annotated
LREC 2010 A French Human Reference Corpus for Multi-Document Summarization and Sentence Compression https://aclanthology.coli.uni-saarland.de/papers/L10-1626/a-french-human-reference-corpus-for-multi-document-summarization-and-sentence-compression 0 French see abstract
EACL 2017 Ouyang, Jessica, Serina Chang, and Kathleen McKeown. "Crowd-Sourced Iterative Annotation for Narrative Summarization Corpora." EACL 2017 (2017): 46. http://www.aclweb.org/anthology/E17-2008 http://www.cs.columbia.edu/~ouyangj/aligned-summarization-data/ 0 Abstractive and extractive English 476
A Publicly Available Annotated Corpus for Supervised Email Summarization
Sentence compression
2013 Google sentence compression Overcoming the Lack of Parallel Data in Sentence Compression, Katja Filippova and Yasemin Altun, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP '13), pp. 1481-1491. http://www.aclweb.org/anthology/D/D13/D13-1155.pdf https://github.com/google-research-datasets/sentence-compression?files=1 0 Compressive English 250k
2015 Google sentence compression 2 Sentence Compression by Deletion with LSTMs. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sentence+Compression+by+Deletion+with+LSTMs.&btnG= 0 ~2M (but only 10k released?)
Sentence simplification
2017 The WebSplit Benchmark https://arxiv.org/pdf/1707.06971.pdf 0 1066115
Special kinds of summarization
EMNLP 2017 https://aclanthology.coli.uni-saarland.de/papers/D17-1223/d17-1223 Overview
EMNLP 2017 https://aclanthology.coli.uni-saarland.de/papers/D17-1322/d17-1322 Concept maps
ACL 2013 CMU Movie Summary Corpus Learning Latent Personas of Film Characters. David Bamman, Brendan O'Connor, and Noah A. Smith. ACL 2013, Sofia, Bulgaria, August 2013 http://www.cs.cmu.edu/~ark/personas/ Publicly available on website 0 CC-BY-SA 3.0 Abstractive English 42306
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment